class: inverse, center, title-slide, middle <style> .title-slide .remark-slide-number { display: none; } </style> # .title-wrap[Intro to Programming with R for Political Scientists] <br /> ## .header-fancy[Session 3: Data Wrangling] ### Markus Freitag ### Geschwister Scholl Institute of Political Science, LMU ### [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#415564;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg>](https://twitter.com/MarkusGFreitag) [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#415564;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg>](https://markusfreitag.netlify.app/) ### 2021-07-03 <a href="https://github.com/m-freitag" class="github-corner" aria-label="View source on Github"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#415564; color:#f6f3f2; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style> --- # Overview 1. Intro 2. R-Studio and (Git)Hub 3. Base R & Tidyverse Basics 4. .hl[Data Wrangling] 5. Data Viz 6. Writing Functions 7. A complete scientific workflow with R --- # Workflow - Navigate to `Session Scripts > Session 3` and open `Session_3_script.R`. - You will see a pre-formatted Script with all the steps I do on the slides. - Explore as you follow. - If you have a second monitor, great! If not, split your screen. --- # Tidyverse Packages <img src="data:image/png;base64,#Figs/tidyverse.png" width="45%" style="display: block; margin: auto;" /> - These are only the tidyverse packages loaded by default. There are [more](https://tidyverse.tidyverse.org/) (e.g. [lubridate](https://github.com/tidyverse/lubridate)). --- # The Philosophy: Tidy Data > 1. Each variable forms a column. > 2. Each observation forms a row. > 3. Each type of observational unit forms a table. [(Wickham 2014)](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf) - Sounds pretty reasonable. 3. is related to constructing clean relational data bases (we will see an example later). - However, we can relax this from time to time given our needs/statistical methods. .hl2[Messy data] is everything else, e.g.: .code60[ ``` ## race character gender age_0-100 age_100-500 age_500-100 age_>1000 ## 1 hobbits Frodo male 1 0 0 0 ## 2 hobbits Sam male 1 0 0 0 ## 3 elves Arwen female 0 0 0 1 ## 4 hobbits Golum male 0 0 1 0 ## 5 dwarves Gimli male 0 1 0 0 ## 6 men Eowyn female 1 0 0 0 ``` ] --- # The Philosophy: Tidy Data Tidy: ``` ## race character gender age_cat ## 1 hobbits Frodo male 0-100 ## 2 hobbits Sam male 0-100 ## 3 elves Arwen female >1000 ## 4 hobbits Golum male 500-100 ## 5 dwarves Gimli male 100-500 ## 6 men Eowyn female 0-100 ``` --- # Tibbles - Tibbles are data.frames but with some perks. For instance, subset more strictly. - They also print differently (better): ```r as_tibble(lotr) ``` ``` ## # A tibble: 6 x 4 ## race character gender age_cat ## <chr> <chr> <chr> <chr> ## 1 hobbits Frodo male 0-100 ## 2 hobbits Sam male 0-100 ## 3 elves Arwen female >1000 ## 4 hobbits Golum male 500-100 ## 5 dwarves Gimli male 100-500 ## 6 men Eowyn female 0-100 ``` ```r # Hint: You can create one just like data frames but with tibble(). ``` --- # Importing/Exporting Data - R comes with it's own two file formats, `.rds` (single objects) and `.rda` (multiple objects/tabular data). - However, for saving the latter you will (and should) use `.csv` or `.json` (human readable) most of the time. -- - For R novices, importing and exporting can be a bit of a pain. - There are different functions/packages for reading different file formats (haven, data.table, readxl etc.). - Thankfully, the [rio](https://github.com/leeper/rio) package by Thomas Leeper and Chung-hong Chan et al. makes our life easier. - Provides `export()` and `import` as wrappers fot the above mentioned packages. ```r install.packages("rio") ``` --- # Importing/Exporting Data ```r library(rio) # Export export(mtcars, "mtcars.csv") # R's built-in mtcars data-set. export(mtcars, "mtcars.rds") # R serialized export(mtcars, "mtcars.dta") # Stata export(mtcars, "mtcars.json") # Import W <- import("mtcars.csv") X <- import("mtcars.rds") Y <- import("mtcars.dta") Z <- import("mtcars.json") ``` --- # Importing/Exporting Data .code70[ ```r # Exporting/importing several data frames: export_list()/import_list() # Make a list of two built-in data sets. # tibble::lst() automatically names the elements: df_list <- tibble::lst(mtcars, iris) export_list(df_list, file = paste0(names(df_list), ".csv")) # export_file takes a character vector; hence, we build one from the names of our element # With the paste0() we paste ".csv" to every element of the character vector # produced by names(df_list). Z <- import_list(dir(pattern = "csv$")) # import_file takes achr vector holding file paths/files. # With dir() we get all names of the files that match a specific pattern (regular expression). # In this case, all files that end with csv ($ matches the end of the string).[1] ``` ] .font70[<sup>[1]</sup> .hl[Fine Point:] [Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) were developed in computer science to specify search patterns/make character matching possible. They are very useful (e.g. for manipulating/cleaning textual data) but pretty hard to memorize. Even if you know the basics, you will google alot. In R, I recommend the [stringi/stringr](https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html) packages.] --- # Pipes: %>% and |> - Pipes are crucial to the tidyverse workflow but also in general. - They make code more readable. Idea: Take the output of a function and to pass it **as the first argument** of another function. `$$f(g(x)) \quad \text{becomes (in R)} \quad x \quad \text{%>%}^{[2]} \quad g() \quad \text{%>%} \quad f()$$` -- .code80[ ```r x <- c(1, 2, 3, 4) sqrt(mean(x)) x %>% mean() %>% sqrt() ``` ] .font70[<sup>[2]</sup> .hl[TIPP:] Press ctrl-shift-m to produce the [magrittr](https://magrittr.tidyverse.org/) pipe, `%>%`, in RStudio.] --- # Pipes: %>% and |> - We can also pass the output as the second, third, ... argument using `.` as a placeholder. - `f(a, b = c)` can be written as `c %>% f(a, b = .)` Example: ```r "Ceci n%est pas une pipe" %>% gsub("%", "'", .) # gsub() performs replacement of all matches in a chr. vector. ``` ``` ## [1] "Ceci n'est pas une pipe" ``` -- - They are so useful, the R core team even introduced a base pipe (`|>`), in May 2021. Something like this happen **very** rarely: ```r x |> mean() |> sqrt() ``` --- # Pipes: %>% and |> - The base pipe is a [tiny](https://statistik-dresden.de/archives/17227) bit faster. - Does not need an extra dependency. Useful for package development (using the magrittr pipe in packages can be annoying sometimes). -- - .hl[BUT]: it does not (yet) properly support the `.` placeholder (as of June 2021). We can hack it tho: ```r Sys.setenv("_R_USE_PIPEBIND_" = "true") "Ceci n%est pas une pipe" |> . => gsub("%", "'", .) ``` ``` ## [1] "Ceci n'est pas une pipe" ``` - That's pretty ugly and [apparently](https://twitter.com/rdataberlin/status/1396170545046097928) still buggy. We will stick to `%>%`. --- class: inverse, center, middle name: intro # Let's Wrangle --- # The Data - We will use the [parlgov](http://www.parlgov.org/) database: > ParlGov is a data infrastructure for political science and contains information for all EU and most OECD democracies (37 countries). The database combines approximately 1700 parties, 1000 elections (9400 results), and 1600 cabinets (3900 parties). - It's relational, i.e. consists of different tables (parties, elections, cabinets) that can be **joined** using key variables. It can also be joined with the [partyfacts](https://partyfacts.herokuapp.com/) dataset that provides id's for many other datasets (e.g. CLEA, ESS). --- # The Data <img src="data:image/png;base64,#Figs/relational.svg" width="60%" style="display: block; margin: auto;" /> --- # The Data - We will use the [parlgov](http://www.parlgov.org/) database: > ParlGov is a data infrastructure for political science and contains information for all EU and most OECD democracies (37 countries). The database combines approximately 1700 parties, 1000 elections (9400 results), and 1600 cabinets (3900 parties). - It's relational, i.e. consists of different tables (parties, elections, cabinets) that can be **joined** using key variables. It can also be joined with the [partyfacts](https://partyfacts.herokuapp.com/) dataset that provides id's for many other datasets (e.g. CLEA, ESS). - It makes for pretty simple examples and hence we use it. Let's import the election data: ```r parlgov_elec <- import("http://www.parlgov.org/static/data/development-cp1252/view_election.csv") ``` --- # The Data: Getting an Overview .code70[ ```r glimpse(parlgov_elec) # enhanced version of str() ``` ``` ## Rows: 8,665 ## Columns: 16 ## $ country_name_short <chr> "AUS", "AUS", "AUS", "AUS", "AUS", "AU~ ## $ country_name <chr> "Australia", "Australia", "Australia",~ ## $ election_type <chr> "parliament", "parliament", "parliamen~ ## $ election_date <date> 1901-03-30, 1901-03-30, 1901-03-30, 1~ ## $ vote_share <dbl> 44.4, 34.2, 19.4, 1.4, 0.6, 29.7, 34.4~ ## $ seats <int> 32, 26, 15, 1, 1, 26, 25, 23, 1, 0, 27~ ## $ seats_total <int> 75, 75, 75, 75, 75, 75, 75, 75, 75, 75~ ## $ party_name_short <chr> "PP", "FTP", "ALP", "none", "one-seat"~ ## $ party_name <chr> "Protectionist Party", "Free Trade Par~ ## $ party_name_english <chr> "Protectionist Party", "Free Trade Par~ ## $ left_right <dbl> 7.4000, 6.0000, 3.8833, NA, NA, 7.4000~ ## $ country_id <int> 33, 33, 33, 33, 33, 33, 33, 33, 33, 33~ ## $ election_id <int> 731, 731, 731, 731, 731, 730, 730, 730~ ## $ previous_parliament_election_id <int> NA, NA, NA, NA, NA, 731, 731, 731, 731~ ## $ previous_cabinet_id <int> NA, NA, NA, NA, NA, 997, 997, 997, 997~ ## $ party_id <int> 1898, 1938, 1253, 1396, 2299, 1898, 19~ ``` ] --- # The Data: Getting an Overview
.hl[Q]: Is this data set tidy? --- # The Data: Getting an Overview .code70[ ```r datasummary_skim(parlgov_elec) # from modelsummary package. Set type = "categorical" for character vars. # Of course, not super informative in our hierarchical data set: ``` <table class="table table" style="width: auto !important; margin-left: auto; margin-right: auto; font-size: 17px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Unique (#) </th> <th style="text-align:right;"> Missing (%) </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Max </th> <th style="text-align:right;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> vote_share </td> <td style="text-align:right;"> 2459 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 11.9 </td> <td style="text-align:right;"> 12.9 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 6.6 </td> <td style="text-align:right;"> 71.2 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="3.12" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.90" y="8.10" width="3.12" height="3.56" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.02" y="9.69" width="3.12" height="1.98" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.14" y="10.42" width="3.12" height="1.25" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.26" y="10.63" width="3.12" height="1.03" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.38" y="10.76" width="3.12" height="0.90" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.50" y="10.94" width="3.12" height="0.72" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.63" y="11.03" width="3.12" height="0.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.75" y="11.12" width="3.12" height="0.54" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="29.87" y="11.30" width="3.12" height="0.36" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="32.99" y="11.54" width="3.12" height="0.12" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="36.11" y="11.63" width="3.12" height="0.031" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.23" y="11.66" width="3.12" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.35" y="11.66" width="3.12" height="0.0048" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="45.47" y="11.66" width="3.12" height="0.0024" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> seats </td> <td style="text-align:right;"> 291 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 23.3 </td> <td style="text-align:right;"> 45.7 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 6.0 </td> <td style="text-align:right;"> 470.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="4.73" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.51" y="10.93" width="4.73" height="0.73" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.23" y="11.42" width="4.73" height="0.24" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.96" y="11.53" width="4.73" height="0.13" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.69" y="11.60" width="4.73" height="0.058" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="25.42" y="11.60" width="4.73" height="0.060" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.15" y="11.64" width="4.73" height="0.025" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.87" y="11.65" width="4.73" height="0.010" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.60" y="11.66" width="4.73" height="0.0034" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="44.33" y="11.66" width="4.73" height="0.0011" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> seats_total </td> <td style="text-align:right;"> 209 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 212.2 </td> <td style="text-align:right;"> 181.3 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 151.0 </td> <td style="text-align:right;"> 709.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.46" y="4.82" width="3.16" height="6.84" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.62" y="4.24" width="3.16" height="7.43" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="7.78" y="3.76" width="3.16" height="7.90" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.93" y="3.22" width="3.16" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.09" y="8.73" width="3.16" height="2.94" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.24" y="10.52" width="3.16" height="1.14" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.40" y="9.38" width="3.16" height="2.28" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.56" y="11.29" width="3.16" height="0.37" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.71" y="11.42" width="3.16" height="0.24" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="29.87" y="9.27" width="3.16" height="2.39" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.03" y="10.91" width="3.16" height="0.75" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="36.18" y="10.50" width="3.16" height="1.16" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.34" y="8.92" width="3.16" height="2.74" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.50" y="11.37" width="3.16" height="0.29" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="45.65" y="11.55" width="3.16" height="0.11" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> left_right </td> <td style="text-align:right;"> 413 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 5.1 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 5.7 </td> <td style="text-align:right;"> 9.8 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="10.29" width="4.52" height="1.38" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.30" y="6.81" width="4.52" height="4.85" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.83" y="7.42" width="4.52" height="4.24" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.35" y="3.67" width="4.52" height="7.99" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="19.87" y="7.86" width="4.52" height="3.80" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="24.40" y="3.79" width="4.52" height="7.88" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="28.92" y="5.30" width="4.52" height="6.37" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.44" y="3.22" width="4.52" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.97" y="7.51" width="4.52" height="4.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.49" y="10.53" width="4.52" height="1.13" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> country_id </td> <td style="text-align:right;"> 37 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 37.2 </td> <td style="text-align:right;"> 20.5 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 37.0 </td> <td style="text-align:right;"> 75.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.18" y="8.46" width="3.00" height="3.20" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.18" y="3.22" width="3.00" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="7.18" y="8.54" width="3.00" height="3.12" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.19" y="11.03" width="3.00" height="0.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="13.19" y="5.87" width="3.00" height="5.79" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="16.19" y="4.48" width="3.00" height="7.19" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="19.20" y="4.26" width="3.00" height="7.40" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.20" y="5.13" width="3.00" height="6.53" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="25.20" y="3.64" width="3.00" height="8.02" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="28.20" y="11.66" width="3.00" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="31.21" y="6.85" width="3.00" height="4.81" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.21" y="7.03" width="3.00" height="4.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.21" y="5.09" width="3.00" height="6.57" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="40.22" y="7.74" width="3.00" height="3.93" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.22" y="8.76" width="3.00" height="2.90" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> election_id </td> <td style="text-align:right;"> 998 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 569.5 </td> <td style="text-align:right;"> 320.2 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 583.0 </td> <td style="text-align:right;"> 1094.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.74" y="5.51" width="4.07" height="6.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.80" y="5.49" width="4.07" height="6.17" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.87" y="5.70" width="4.07" height="5.96" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="13.94" y="5.60" width="4.07" height="6.06" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="18.00" y="5.77" width="4.07" height="5.89" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.07" y="4.99" width="4.07" height="6.67" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.13" y="5.23" width="4.07" height="6.43" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.20" y="4.27" width="4.07" height="7.40" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.27" y="4.97" width="4.07" height="6.70" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="38.33" y="6.08" width="4.07" height="5.58" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.40" y="3.22" width="4.07" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> previous_parliament_election_id </td> <td style="text-align:right;"> 794 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 543.0 </td> <td style="text-align:right;"> 301.5 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 563.0 </td> <td style="text-align:right;"> 1079.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.74" y="6.12" width="4.12" height="5.54" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.86" y="5.51" width="4.12" height="6.16" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.98" y="6.04" width="4.12" height="5.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.11" y="6.28" width="4.12" height="5.39" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="18.23" y="5.48" width="4.12" height="6.18" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.35" y="5.23" width="4.12" height="6.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.47" y="4.97" width="4.12" height="6.69" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.60" y="3.22" width="4.12" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.72" y="7.09" width="4.12" height="4.57" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="38.84" y="5.95" width="4.12" height="5.71" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.97" y="7.26" width="4.12" height="4.40" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> previous_cabinet_id </td> <td style="text-align:right;"> 833 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 752.0 </td> <td style="text-align:right;"> 468.0 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 734.0 </td> <td style="text-align:right;"> 1634.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.72" y="3.22" width="2.72" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.45" y="3.54" width="2.72" height="8.13" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="7.17" y="4.06" width="2.72" height="7.60" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.89" y="5.89" width="2.72" height="5.77" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="12.62" y="4.39" width="2.72" height="7.27" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.34" y="4.72" width="2.72" height="6.94" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="18.06" y="5.27" width="2.72" height="6.39" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.79" y="4.10" width="2.72" height="7.57" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.51" y="3.93" width="2.72" height="7.73" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.23" y="5.87" width="2.72" height="5.80" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="28.96" y="3.37" width="2.72" height="8.29" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="31.68" y="5.88" width="2.72" height="5.78" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.40" y="6.99" width="2.72" height="4.67" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.13" y="9.05" width="2.72" height="2.61" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.85" y="7.32" width="2.72" height="4.34" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.57" y="4.05" width="2.72" height="7.62" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="45.30" y="9.85" width="2.72" height="1.81" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> party_id </td> <td style="text-align:right;"> 1559 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1120.3 </td> <td style="text-align:right;"> 738.8 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 1015.0 </td> <td style="text-align:right;"> 2812.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewbox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clippath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clippath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clippath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.75" y="4.10" width="3.16" height="7.56" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.91" y="3.22" width="3.16" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.07" y="4.25" width="3.16" height="7.41" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.24" y="3.25" width="3.16" height="8.42" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.40" y="3.70" width="3.16" height="7.97" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.56" y="4.39" width="3.16" height="7.27" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.73" y="3.44" width="3.16" height="8.22" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.89" y="4.19" width="3.16" height="7.48" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.05" y="7.92" width="3.16" height="3.74" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.22" y="10.00" width="3.16" height="1.66" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.38" y="8.78" width="3.16" height="2.88" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="36.54" y="8.14" width="3.16" height="3.52" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.71" y="8.06" width="3.16" height="3.60" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.87" y="8.56" width="3.16" height="3.10" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="46.03" y="11.54" width="3.16" height="0.12" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> </tbody> </table> ] .font70[.hl[TIPP:] The [collapse](https://github.com/SebKrantz/collapse) package also provides some data summary functions (`descr()` and `qsu()`) that work really well on huge data sets. Stata users will love it as its similar to summarize.] --- # Tidyverse's Wrangling "Wunderkind": dplyr <img src="data:image/png;base64,#https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/dplyr_wrangling.png" width="40%" style="display: block; margin: auto;" /> <center> .font70[© [Alison Horst](https://github.com/allisonhorst/stats-illustrations)] <center/> A full list of `dplyr` functions can be found [here](https://dplyr.tidyverse.org/reference/index.html). --- # Data Masking - Before we jump right in, remember when we talked about R environments in the last session? - When we wanted to pass a variable of a data frame (or element of a list) to a function, we needed to make that .hl[explicit], e.g., using `$` for subsetting. ```r lm(df$y ~ df$x) ``` - Some important dplyr functions use [data masking](https://dplyr.tidyverse.org/articles/programming.html) such that you can refer to variables just like objects in the (global) envornment. - Get's rid of .hl[$]s. --- # Filtering Rows - Let's start with something very basic: filtering rows. - For instance, we might want to obtain only the results of all Bundestag elections: .code70[ ```r parlgov_elec_de <- parlgov_elec %>% # add, e.g., _de if we want to keep our original df filter(country_name_short == "DEU") ``` ] -- - Now, say we want the mean SPD vote share across all bundestag elections (bit silly but alas): .code60[ ```r datasummary_skim(parlgov_elec_de$election_type, type = "categorical") parlgov_elec_de %>% # add, e.g., _de if we want to keep our original df filter(party_name_short == "SPD", election_type == "parliament") %>% mean(vote_share) ``` ] - That does not work....hl[Q]: Why? .font70[.hl[Note] that using convenience wrappers such as `datasummary()` are no substitute for learning the underlying base, tidyverse or data.table-way of achieving this.] --- # summarise() and pull() - Base-R's summary stats functions can't take a df or tibble and we did not wrap it into a dplyr function (.hl[in order for data masking to kick in]). Two options: .pull-left[ 1\. We can use `pull()` to pull out a vector form a df; works like `$` but for pipes. ```r parlgov_elec_de %>% # add, e.g., _de if we want to keep our original df filter(party_name_short == "SPD", election_type == "parliament") %>% pull(vote_share) %>% mean() ``` ``` ## [1] 31.70964 ``` ] -- .pull-right[ 2\. We can wrap `mean()`into `summarise()` ("data-masking" for built-in summary functions): ```r parlgov_elec_de %>% filter(party_name_short == "SPD", election_type == "parliament") %>% summarise(mean(vote_share)) # summarise() takes summary functions such as mean(), sd(), etc. ``` ``` ## mean(vote_share) ## 1 31.70964 ``` ] --- # summarise() - Puts out a tibble we can pass to other dplyr functions. - We can also compute more summary statistics with `summarise()` .hl[AND] also turn them into variables: ```r parlgov_elec_de %>% filter(party_name_short == "SPD", election_type == "parliament") %>% summarise(mean = mean(vote_share), sd = sd(vote_share), n = n()) ``` ``` ## mean sd n ## 1 31.70964 8.374648 28 ``` ```r # Aside: you can also get the number of rows (in a group) by usining the base # function nrow(.) with a placeholder in the above pipe. ``` - In that, it is somewhat similar to the `mutate()` command... but it .hl2[collapses the data]. --- # mutate() - With `mutate()` you can add columns - or overwrite existing ones (with the same name) - based on transformations of other variables - For instance, say we want to add a new column "year" based on the election date variable: ```r parlgov_elec_de <- parlgov_elec_de %>% # here, we "overwrite" our df mutate(year = lubridate::year(election_date)) ``` -- - Here, we used the `year()` function of the [lubridate](https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf) package. - We pulled out an integer vector indicating the election year from the `date` variable and assigned it to a new variable.<sup>[3]<sup/> - `transmute()` is the opposite of `mutate()`; i.e. adds a new variable and drops the rest. .font70[ <sup>[3]<sup/>] .font70[.hl[NOTE] We will use some core `lubridate` functions on the fly as they are rather intuitive. If you deal with dates often, give [this](https://r4ds.had.co.nz/dates-and-times.html) intro a read.] --- # select() and arrange() - Another basic wrangling operation we occasionally need to do do is selecting columns. - And sometimes we also want to sort columns: .code70[ ```r parlgov_elec_de %>% filter(year == 2017) %>% select(party_name_short, vote_share, left_right) %>% arrange(desc(left_right)) #default is ascending; we can wrap the masked vector with desc() to sort descending ``` ``` ## party_name_short vote_share left_right ## 1 AfD 12.6 8.8000 ## 2 FW 1.0 7.4000 ## 3 CSU 6.2 7.2871 ## 4 CDU 26.8 6.2503 ## 5 FDP 10.7 5.9233 ## 6 SPD 20.5 3.6451 ## 7 B90/Gru 8.9 2.9308 ## 8 PDS|Li 9.2 1.2152 ## 9 PARTEI 1.0 NA ``` ] --- # group_by() - The last of the most fundamental dplyr "verbs" we will learn is `group_by`. - Given the variables you supply, it converts a data frame into a grouped version of it such that you can do transformations, call functions, manipulate variables, etc. - For instance, let's use the whole data to get the the party with the most votes ever in a parliamentary election by country... --- # group_by() .code60[ ```r parlgov_elec %>% filter(election_type == "parliament") %>% group_by(country_name_short) %>% summarise(share_max = max(vote_share, na.rm = T), party = party_name_english[1], election_date = election_date[1] ) %>% # note the "collapsing" I mentioned earlier arrange(desc(share_max)) ``` ``` ## # A tibble: 37 x 4 ## country_name_short share_max party election_date ## <chr> <dbl> <chr> <date> ## 1 CYP 71.2 Democratic Party 1976-09-05 ## 2 LVA 68.2 Popular Front of Latvia 1990-04-29 ## 3 ROU 66.3 Democratic Party 1990-05-20 ## 4 MLT 59.9 Malta Labour Party 1947-10-27 ## 5 JPN 59 Japan Liberal Party 1946-04-10 ## 6 NZL 58.7 New Zealand Liberal Party 1902-11-25 ## 7 CAN 57 Liberal Party of Canada 1900-11-07 ## 8 CHE 56.2 Radical Democratic Party 1902-10-26 ## 9 BEL 56 Catholic Party 1900-05-27 ## 10 GBR 55.0 Conservatives 1918-12-14 ## # ... with 27 more rows ``` ] --- # group_by() - Too complex? - Ok, let's do something more simple and add a variable that tells us the maximum number of total seats for each country. -- ```r parlgov_elec <- parlgov_elec %>% group_by(country_id) %>% mutate(max_seats = max(seats_total)) %>% ungroup() # to remove the grouping ``` --- # group_by() .code70[ ```r parlgov_elec %>% filter(seats_total == max_seats) %>% select(country_name_short, election_date, max_seats) %>% distinct() # select distinct rows ``` ``` ## # A tibble: 309 x 3 ## country_name_short election_date max_seats ## <chr> <date> <int> ## 1 AUS 2019-05-18 151 ## 2 AUT 1920-10-17 183 ## 3 AUT 1971-10-10 183 ## 4 AUT 1975-10-05 183 ## 5 AUT 1979-05-06 183 ## 6 AUT 1983-04-24 183 ## 7 AUT 1986-11-23 183 ## 8 AUT 1990-10-07 183 ## 9 AUT 1994-10-09 183 ## 10 AUT 1995-12-17 183 ## # ... with 299 more rows ``` ] --- # Other usefull stuff - `tidyr::separate()` to separate entries in columns; `tidyr::unite` to unite columns - `dplyr::rename()` to rename variables. - `dplyr::count()` count unique values of one or multiple variables. - `dplyr::distinct()` or Base R `unique()` to remove duplicate rows - `dplyr::slice()` to get a slice of rows: `parlgov_elec %>% slice(1:5)` to get the first 5 rows. - `dplyr::across()` apply functions to multiple columns. .font70[.hl[TIPP] Check out the [`modelsummary`](https://vincentarelbundock.github.io/modelsummary/articles/datasummary.html) functions in more detail for easy crosstabs. ALternatively, use [`janitor::tabyl`](https://github.com/sfirke/janitor).] --- class: inverse, center, middle name: intro # Session 3 - Problem Set --- class: inverse, center, middle name: intro # Advanced Wrangling --- # Dealing with factor variables: `forcats` - Factor variables are useful; especially for plotting and modelling. - With `factor_recode` we can easily recode levels: ```r parlgov_elec_de %>% mutate(election_type = fct_recode(election_type, # Coerces the type automatically from chr to fct. Bundestagswahl = "parliament", Europawahl = "ep" )) %>% count(election_type) ``` ``` ## election_type n ## 1 Europawahl 82 ## 2 Bundestagswahl 264 ``` - With `fct_recorder` we can reorder factors (will be usefull for plotting factors). --- # Complex conditions: `if_else` and `case_when` - Often, we also want to manipulate variables by means of complex conditions - We will go deeper into control flow statements next week, but here is a sneak preview for data wrangling. - Say we want to create a variable, "family", that puts parties into some party family based on some arbitrary cutoff of the time-invariant left_right position: ```r parlgov_elec_de <- parlgov_elec_de %>% mutate(family = if_else(left_right > 5, "right", "left")) ``` - Vectorised if: `if_else(condition, true, false)`. --- # Complex conditions: `if_else` and `case_when` - A generalised version of `if_else` is `case_when`. - This is .hl2[very] useful: ```r parlgov_elec_de <- parlgov_elec_de %>% mutate(family = case_when( left_right <= 2.5 ~ "left", left_right > 2.5 & left_right < 5 ~ "centre-left", left_right > 5 & left_right < 7.5 ~ "centre-right", left_right >= 7.5 ~ "right")) ``` - two-sided formula: LHS = logical test; RHS = value to assign if the test is `TRUE`. - Values that do not fall into any of the conditions become `NA` which can be prevented by adding `TRUE ~ something` as the last argument. --- # tidyr: reshaping data - Reshaping data is one of the key things you need to when cleaning/analysing data. - [Two functions](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf): - `tidyr::pivot_wider/longer()` is for reshaping from long (wide) to wide (long). Two main arguments: - `names_*` and `values_*`, where "*" is "to" for `pivot_wider()` and "from" for `pivot_longer()`. -- .pull-left[ <img src="data:image/png;base64,#Figs/wide.png" width="60%" style="display: block; margin: auto;" /> <center> long → wide: `pivot_wider()` <center/> ] .pull-right[ <img src="data:image/png;base64,#Figs/longer.png" width="60%" style="display: block; margin: auto;" /> <br/> <center> wide → long: `pivot_longer()` <center/> ] --- # tidyr: reshaping data .hl[Example:] - Say, we want a table of the vote shares of all major parties for each post-WW2 parliamentary election. - Where each row is an election: ```r wide <- parlgov_elec_de %>% filter(election_type == "parliament", vote_share >= 5, year(election_date) >= 1945) %>% select(election_date, party_name_short, vote_share) %>% pivot_wider(names_from = party_name_short, values_from = vote_share) ``` --- # tidyr: reshaping data
--- # tidyr: reshaping data - We can revert back to long format: ```r long <- wide %>% pivot_longer(!election_date, names_to = "party_name_short", values_to = "vote_share") %>% filter(is.na(vote_share) == FALSE) # alternatively, simply set values_drop_na to TRUE in pivot_longer(). head(long) ``` ``` ## # A tibble: 6 x 3 ## election_date party_name_short vote_share ## <date> <chr> <dbl> ## 1 1949-08-14 SPD 29.2 ## 2 1949-08-14 CDU 25.2 ## 3 1949-08-14 FDP 11.9 ## 4 1949-08-14 CSU 5.8 ## 5 1949-08-14 KPD 5.7 ## 6 1953-09-06 SPD 28.8 ``` --- # dplyr: joins - Let's come back to the relational nature of our data... --- # The Data <img src="data:image/png;base64,#Figs/relational.svg" width="60%" style="display: block; margin: auto;" /> --- # dplyr: joins - Let's come back to the [relational](https://en.wikipedia.org/wiki/Relational_database) nature of our data... - Remember, they consist of different tables, each representing one "entity type" (c.f. the 3rd. point in Wickham's tidy data framework) - Each table has a unique key, representing each row. This key variable is used to link/join tables - Suppose we want to join the party and the election table, how do we do that? --- # dplyr: joins <img src="data:image/png;base64,#Figs/joins.png" width="35%" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#Figs/fjoins.png" width="35%" style="display: block; margin: auto;" /> .hl2[Q] Which join do we need? --- # dplyr: joins We want a left join here... .code70[ ```r parlgov_party <- rio::import("http://www.parlgov.org/static/data/development-utf-8/view_party.csv") l_joined <- left_join(parlgov_elec_de, parlgov_party, by = "party_id") head(l_joined) ``` ``` ## country_name_short.x country_name.x election_type election_date vote_share ## 1 DEU Germany parliament 1919-01-19 37.87 ## 2 DEU Germany parliament 1919-01-19 18.32 ## 3 DEU Germany parliament 1919-01-19 15.45 ## 4 DEU Germany parliament 1919-01-19 10.26 ## 5 DEU Germany parliament 1919-01-19 4.66 ## 6 DEU Germany parliament 1919-01-19 7.63 ## seats seats_total party_name_short.x ## 1 165 423 SPD ## 2 74 423 DDP ## 3 73 423 DZ ## 4 41 423 DNVP ## 5 23 423 DVP ## 6 22 423 USPD ## party_name.x ## 1 Sozialdemokratische Partei Deutschlands ## 2 Deutsche Demokratische Partei ## 3 Deutsche Zentrumspartei ## 4 Deutschnationale Volkspartei ## 5 Deutsche Volkspartei ## 6 Unabhängige Sozialdemokratische Partei Deutschland ## party_name_english.x left_right.x country_id.x ## 1 Social Democratic Party of Germany 3.6451 54 ## 2 German Democratic Party 6.0000 54 ## 3 Centre Party 6.2000 54 ## 4 German National People's Party 8.8000 54 ## 5 German People's Party 7.4000 54 ## 6 Independent Social Democratic Party of Germany 1.3000 54 ## election_id previous_parliament_election_id previous_cabinet_id party_id year ## 1 1031 NA NA 558 1919 ## 2 1031 NA NA 2691 1919 ## 3 1031 NA NA 137 1919 ## 4 1031 NA NA 2692 1919 ## 5 1031 NA NA 2693 1919 ## 6 1031 NA NA 2690 1919 ## family country_name_short.y country_name.y party_name_short.y ## 1 centre-left DEU Germany SPD ## 2 centre-right DEU Germany DDP ## 3 centre-right DEU Germany DZ ## 4 right DEU Germany DNVP ## 5 centre-right DEU Germany DVP ## 6 left DEU Germany USPD ## party_name_english.y ## 1 Social Democratic Party of Germany ## 2 German Democratic Party ## 3 Centre Party ## 4 German National People's Party ## 5 German People's Party ## 6 Independent Social Democratic Party of Germany ## party_name.y ## 1 Sozialdemokratische Partei Deutschlands ## 2 Deutsche Demokratische Partei ## 3 Deutsche Zentrumspartei ## 4 Deutschnationale Volkspartei ## 5 Deutsche Volkspartei ## 6 Unabhängige Sozialdemokratische Partei Deutschland ## party_name_ascii family_name_short ## 1 Sozialdemokratische Partei Deutschlands soc ## 2 Deutsche Demokratische Partei lib ## 3 Deutsche Zentrumspartei chr ## 4 Deutschnationale Volkspartei right ## 5 Deutsche Volkspartei con ## 6 Unabhaengige Sozialdemokratische Partei Deutschland com ## family_name left_right.y state_market liberty_authority eu_anti_pro ## 1 Social democracy 3.6451 3.8443 3.9791 7.9318 ## 2 Liberal 6.0000 6.7000 3.6000 8.7000 ## 3 Christian democracy 6.2000 5.7000 7.1000 8.3000 ## 4 Right-wing 8.8000 5.9000 8.5000 2.3000 ## 5 Conservative 7.4000 6.4000 7.0000 7.9000 ## 6 Communist/Socialist 1.3000 1.4000 3.0000 3.3000 ## cmp euprofiler ees castles_mair huber_inglehart ray benoit_laver chess ## 1 41320 120 1276320 803 1503 302 6087 302 ## 2 NA NA NA NA NA NA NA NA ## 3 41522 NA NA NA NA NA NA NA ## 4 NA NA NA NA NA NA NA NA ## 5 NA NA NA NA NA NA NA NA ## 6 NA NA NA NA NA NA NA NA ## country_id.y family_id ## 1 54 11 ## 2 54 6 ## 3 54 3 ## 4 54 40 ## 5 54 26 ## 6 54 14 ``` ] .hl[Q] Why do we get `party_name_short.x` etc? --- # dplyr: joins - There are multiple matching (by name) variables in both tables. - Hence, we need to specify all keys, or let `dplyr` do its magic: .code70[ ```r l_joined <- left_join(parlgov_elec_de, parlgov_party) head(l_joined) ``` ``` ## country_name_short country_name election_type election_date vote_share seats ## 1 DEU Germany parliament 1919-01-19 37.87 165 ## 2 DEU Germany parliament 1919-01-19 18.32 74 ## 3 DEU Germany parliament 1919-01-19 15.45 73 ## 4 DEU Germany parliament 1919-01-19 10.26 41 ## 5 DEU Germany parliament 1919-01-19 4.66 23 ## 6 DEU Germany parliament 1919-01-19 7.63 22 ## seats_total party_name_short ## 1 423 SPD ## 2 423 DDP ## 3 423 DZ ## 4 423 DNVP ## 5 423 DVP ## 6 423 USPD ## party_name ## 1 Sozialdemokratische Partei Deutschlands ## 2 Deutsche Demokratische Partei ## 3 Deutsche Zentrumspartei ## 4 Deutschnationale Volkspartei ## 5 Deutsche Volkspartei ## 6 Unabhängige Sozialdemokratische Partei Deutschland ## party_name_english left_right country_id ## 1 Social Democratic Party of Germany 3.6451 54 ## 2 German Democratic Party 6.0000 54 ## 3 Centre Party 6.2000 54 ## 4 German National People's Party 8.8000 54 ## 5 German People's Party 7.4000 54 ## 6 Independent Social Democratic Party of Germany 1.3000 54 ## election_id previous_parliament_election_id previous_cabinet_id party_id year ## 1 1031 NA NA 558 1919 ## 2 1031 NA NA 2691 1919 ## 3 1031 NA NA 137 1919 ## 4 1031 NA NA 2692 1919 ## 5 1031 NA NA 2693 1919 ## 6 1031 NA NA 2690 1919 ## family party_name_ascii family_name_short ## 1 centre-left Sozialdemokratische Partei Deutschlands soc ## 2 centre-right Deutsche Demokratische Partei lib ## 3 centre-right Deutsche Zentrumspartei chr ## 4 right Deutschnationale Volkspartei right ## 5 centre-right Deutsche Volkspartei con ## 6 left <NA> <NA> ## family_name state_market liberty_authority eu_anti_pro cmp ## 1 Social democracy 3.8443 3.9791 7.9318 41320 ## 2 Liberal 6.7000 3.6000 8.7000 NA ## 3 Christian democracy 5.7000 7.1000 8.3000 41522 ## 4 Right-wing 5.9000 8.5000 2.3000 NA ## 5 Conservative 6.4000 7.0000 7.9000 NA ## 6 <NA> NA NA NA NA ## euprofiler ees castles_mair huber_inglehart ray benoit_laver chess ## 1 120 1276320 803 1503 302 6087 302 ## 2 NA NA NA NA NA NA NA ## 3 NA NA NA NA NA NA NA ## 4 NA NA NA NA NA NA NA ## 5 NA NA NA NA NA NA NA ## 6 NA NA NA NA NA NA NA ## family_id ## 1 11 ## 2 6 ## 3 3 ## 4 40 ## 5 26 ## 6 NA ``` ] .font70[.hl[NOTE] Be careful if variables are named the same but measure different things.] --- # Alternative approaches - For every tidyverse function ("verb"), there is, of course, a base R way to do it. - There are alternatives. - For instance, the [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) and [collapse](https://raw.githubusercontent.com/SebKrantz/cheatsheets/master/collapse.pdf) (also comes with fast versions of summary stats and models) package provide great and fast data wrangling alternatives. - Data.table syntax is closer to the base R way of indexing/manipulating data frames. Some like that. - Don't be dogmatic. Use whatever suits you and your context. Mix stuff. - You can find a great comparison of `data.table` and `dplyr` [here](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/). --- # A glimpse at `data.table` - Comes with its own interpretation of data frames, "data tables". Special structure to work faster. - Looks similar to basic `df[]` indexing but with alot of twists. - Three elements: which observations/rows COMMA transformations or other functions COMMA grouping. Rough `dplyr` equivalent: <center> DT[.hl[`slice(); filter(); arrange()`], .hl2[`select(); mutate()`], `group_by()`] <center/> --- # A glimpse at `data.table` Example: .pull-left[ ```r parlgov_elec_de %>% # add, e.g., _de if we want to keep our original df filter(party_name_short == "SPD") %>% summarise(mean(vote_share, na.rm = T)) ``` ``` ## mean(vote_share, na.rm = T) ## 1 31.12622 ``` ] .pull-right[ ```r setDT(parlgov_elec_de) parlgov_elec_de[party_name_short == "SPD", mean(vote_share, na.rm = T)] ``` ``` ## [1] 31.12622 ``` ] --- # Problem Set 03/"Homework" - This last problem set may take a bit longer. - Try to get as far as you can in the remaining time and finish the rest at home (if you want). --- class: inverse, center, middle name: intro # Next Up: Data Viz